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Abstract —  Today’s  activities  in  cyber  space  are  more  con¬ 
nected  than  ever  before,  driven  by  the  ability  to  dynamically  inte¬ 
ract  and  share  information  with  a  changing  set  of  partners  over  a 
wide  variety  of  networks.  To  support  dynamic  sharing,  computer 
systems  and  network  are  stood  up  on  a  continuous  basis  to  sup¬ 
port  changing  mission  critical  functionality.  However,  configura¬ 
tion  of  these  systems  remains  a  manual  activity,  with  misconfigu- 
rations  staying  undetected  for  extended  periods,  unneeded  sys¬ 
tems  remaining  in  place  long  after  they  are  needed,  and  systems 
not  getting  updated  to  include  the  latest  protections  against  vul¬ 
nerabilities.  This  environment  provides  a  rich  environment  for 
targeted  cyber  attacks  that  remain  undetected  for  weeks  to 
months  and  pose  a  serious  national  security  threat.  To  counter 
this  threat,  technologies  have  started  to  emerge  to  provide  conti¬ 
nuous  monitoring  across  any  network-attached  device  for  the 
purpose  of  increasing  resiliency  by  virtue  of  identifying  and  then 
mitigating  targeted  attacks.  For  these  technologies  to  be  effective, 
it  is  of  utmost  importance  to  avoid  any  inadvertent  increase  in  the 
attack  surface  of  the  monitored  system.  This  paper  describes  the 
security  architecture  of  Gestalt^  a  next-generation  cyber  informa¬ 
tion  management  platform  that  aims  to  increase  resiliency  by 
providing  ready  and  secure  access  to  granular  cyber  event  data 
available  across  a  network.  Gestalt’s  federated  monitoring  archi¬ 
tecture  is  based  on  the  principles  of  strong  isolation,  least- 
privilege  policies,  defense-in-depth,  crypto-strong  authentication 
and  encryption,  and  self-regeneration.  Remote  monitoring  func¬ 
tionality  is  achieved  through  an  orchestrated  workflow  across  a 
distributed  set  of  components,  linked  via  a  specialized  secure 
communication  protocol,  that  together  enable  unified  access  to 
cyber  observables  in  a  secure  and  resilient  way. 
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L  Introduction 

System  administrators  and  cyber  defenders  continue  to  face 
challenges  in  securing  systems  in  enterprise  environments  as 
attacks  keep  increasing  in  the  level  of  sophistication  and  as  the 
number  of  connected  systems  keeps  increasing.  To  support  and 
automate  manual  activities  associated  with  obtaining  informa¬ 
tion  about  systems  and  taking  corrective  action  in  response  to 
suspicious  activities,  an  increasing  number  of  technologies  for 
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Fig.  1 .  Attack  Surface  of  an  Exemplar  Remote  Monitoring  System 


remote  monitoring  are  becoming  available  with  the  premise  of 
increasing  resiliency  by  decreasing  the  time-to-detect  and  time- 
to-mitigate  targeted  attacks. 

While  the  functional  benefit  of  new  protocols  and  tools  that 
support  continuous  monitoring  and  incident  response  is  clear,  it 
is  quite  common  for  these  tools  to  fail  on  the  security  front  by 
either  (1)  providing  inadequate  security,  e.g.,  by  adding  to  the 
attack  surface  and  thereby  enabling  adversaries  to  remotely 
monitor  /  manage  critical  infrastructure  or  (2)  requiring  a  very 
stringent  set  of  security  controls  that  are  prohibitively  difficult 
to  implement,  thereby  limiting  adopting  in  the  market  place. 
Versions  1  and  2  of  the  Simple  Network  Management  Protocol 
(SNMP)  [1]  provide  inadequate  security  and  are  widely 
adopted,  while  version  3  has  started  to  provide  acceptable  secu¬ 
rity  but  has  a  limited  deployment  footprint.  WS-Security  [2] 
and  the  Common  Secure  Interoperability  Protocol  Version  2 
(CSIv2)  [3]  are  examples  of  protocols  that  started  out  with 
complex  security  controls  and  did  not  achieve  anticipated  mar¬ 
ket  penetration  and  adoption. 

As  a  motivating  example,  consider  an  attack  progression  in 
a  remote  monitoring  system  built  on  a  web  services  stack  as 
displayed  in  Fig.  1.  The  base  system  consists  of  a  client  appli¬ 
cation  (left)  interacting  with  an  application  server  (right), 
which  in  turn  accesses  internal  state  of  the  monitored  device. 
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An  adversary  can  start  out  by  gathering  sensitive  informa¬ 
tion  through  network  sniffing  (attack  1)  and  perform  target 
identification  through  port  scanning  and  remote  operating  sys¬ 
tem  (OS)  fingerprinting  (attack  2).  Once  the  list  of  open  ports 
and  OS  type  is  determined,  a  variety  of  off-the-shelf  attacks, 
such  as  S^TO  flooding  [4]  can  be  employed  against  older  ver¬ 
sions  of  operating  systems  to  cause  denial  of  service  or  privi¬ 
lege  escalation  on  systems  that  have  not  been  hardened  or 
patched  appropriately.  Advancing  up  the  network  stack,  the 
adversary  can  attack  the  TCP  connection  handling  code  of  the 
event  listeners  (attack  3)  by  establishing  and  maintaining  a 
large  number  of  TCP  connections  [5],  causing  thread  resource 
bottlenecks  and  denying  service  to  legitimate  clients.  A  more 
sophisticated  adversary  can  attack  the  Java  Virtual  Machines 
(JVMs)  on  which  the  services  and  applications  are  running 
(attack  4),  e.g.,  by  creating  maliciously  crafted  serialized  ob¬ 
jects  to  execute  arbitrary  code  during  de-serialization  [6].  Simi¬ 
larly,  creating  specific  XML  documents  can  cause  SOAP  pro¬ 
cessors  to  crash.  Furthermore,  attackers  may  devise  applica¬ 
tion-level  attacks,  such  as  floods  targeted  at  registry  services  to 
deny  access  to  legitimate  clients  by  overloading  registry 
processing  (attack  5),  brute-force  password  combinations  (at¬ 
tack  6),  and  mapping  out  critical  URLs  that  get  returned 
through  error  pages  and  modifying  input  used  to  generate  the 
errors  to  inject  SQL,  XPATH,  or  XQUERY  commands  (attack 
7).  Once  compromised,  the  adversary  can  use  components  to 
corrupt  the  target  and/or  exfiltrate  information,  including  ac¬ 
cessing  sensitive  data  that  should  not  be  made  available  re¬ 
motely  and  disseminating  it  to  unauthorized  external  receiver 
endpoints. 

This  paper  describes  an  innovative  framework  for  remote 
monitoring  that  (1)  strengthens  overall  security  by  limiting 
unintentional  increase  to  the  resulting  attack  surface  and  (2) 
can  operate  in  contested  network  environments,  including  tran¬ 
sient  and  high-latency  network  links.  We  argue  that  such  a  re¬ 
mote  monitoring  framework  is  a  key  enabler  for  the  larger  con¬ 
cepts  of  reactive  and  proactive  cyber  resiliency,  as  cyber  deci¬ 
sion  making  is  inevitably  driven  by  sensor  information  captur¬ 
ing  the  effects  of  both  attacks  and  defender-initiated  actions. 
The  framework  is  currently  being  developed  under  the  Gestalt 
project  in  support  of  D  ARP  A’ s  Integrated  Cyber  Analysis  Sys¬ 
tem  (ICAS)  [7]  program.  The  objective  of  Gestalt  is  to  provide 
federated  access  to  a  large  diverse  set  of  cyber  observables  to 
enable  detection  of  targeted  cyber  attacks.  Gestalt  automatical¬ 
ly  discovers  available  data  sources,  unifies  access  to  obser¬ 
vables  via  a  comprehensive  common  ontology,  and  automati¬ 
cally  decomposes  and  federates  queries  and  semantically  inte¬ 
grates  the  results.  The  implementation  status  of  framework  is  at 
a  Technology  Readiness  Level  (TRL)  of  4,  with  basic  functio¬ 
nality  tested  in  the  development  environment. 

The  Gestalt  system  eliminates  tedious  manual  inspection  by 
providing  access  to  all  data  sources  on  the  network  via  a  fede¬ 
rated  query  interface.  Using  a  new  Cyber  Defense  Language,  a 
single  query  can  access  data  residing  on  multiple  devices, 
across  disparate  device  types  and  data  formats,  and  return  the 
query  results  in  a  semantically  integrated  and  immediately  use- 
fril  format.  Gestalt  allows  the  cyber  defender  to  focus  on  the 
forensic  data  itself  by  abstracting  away  the  actual  methods  and 
techniques  required  to  access  that  forensic  data.  Through  its 


Semantic  Query  Decomposition  capabilities.  Gestalt  infers  the 
types  of  data  sources  that  can  be  used  to  satisfy  a  given  query, 
and  identifies  where  instances  of  those  data  source  types  can  be 
found  on  the  network.  Next,  it  dispatches  native  queries  to  the 
device  containing  each  data-source  instance  to  process  the  re¬ 
quest.  The  results  are  semantically  integrated  and  returned  to 
the  cyber  defender.  Gestalt  provides  a  single  interface  to  the 
cyber  defender,  dramatically  improving  their  effectiveness  and 
allowing  them  to  focus  their  time  and  expertise  on  forensic 
analysis  of  the  results  of  their  search  queries,  rather  than  on  the 
laborious  process  of  data  collection  and  processing. 

The  paper  is  organized  as  follows:  Section  II  describes  re¬ 
lated  work.  Section  III  provides  a  high-level  overview  of  the 
security  architecture  in  relation  to  a  threat  model.  Section  IV 
dives  into  network-level  security  arguments  while  Section  V 
provides  more  details  on  process-level  security  arguments.  Sec¬ 
tion  VI  concludes  the  paper. 

IT  Related  Work 

The  remote  monitoring  framework  presented  in  this  paper 
relates  to  work  performed  in  cyber  event  monitoring,  network 
monitoring,  and  stream  processing/big  data  platforms. 

A  number  of  commercially  available  solutions  exist  in  the 
Security  Information  and  Event  Management  (SIEM)  and  cy¬ 
ber  monitoring  product  space  today,  including  ArcSight  [8]  and 
the  Host  Based  Security  System  (HBSS)  [9].  While  Gestalt 
provides  detailed  access  to  the  current  system  state,  SIEMs 
provide  extended  summary  information  at  a  coarse  granularity. 

A  number  of  solutions  exist  for  network  and  grid  monitor¬ 
ing  [10],  including  Ganglia  [11],  Nagios  [12],  and  Zabbix  [13]. 
These  systems  specialize  on  performance  monitoring  and  pro¬ 
vide  operators  with  dashboard  views  on  the  current  availability 
state  of  the  overall  network  system.  Similar  to  Gestalt,  many  of 
these  systems  perform  monitoring  through  a  distributed  set  of 
nodes  that  report  to  a  centralized  monitoring  dashboard.  Gan¬ 
glia  in  particular  uses  daemons  installed  on  all  monitored  de¬ 
vices  and  meta-daemons  that  poll  XML  over  TCP  from  dae¬ 
mons  and  meta-daemons.  While  similar.  Gestalt  is  designed 
with  a  stronger  focus  on  security  rather  than  performance,  in¬ 
cluding  mandated  use  of  TLS,  no-listening  sockets  (analogous 
to  meta-daemons),  and  process-level  isolation. 

Finally,  a  number  of  big  data  platforms  exist  for  distributed 
processing  of  information.  Splunk  [14]  [15]  is  a  well-known 
instance  of  a  big  data  processing  capability  that  makes  it  easy 
for  cyber  defenders  to  establish  correlations  between  discon¬ 
nected  pieces  of  text  information  through  a  specialized  query 
language.  Unlike  Gestalt,  Splunk  is  based  on  an  information 
model  that  requires  raw  observables  to  be  aggregated  in  a  cen¬ 
tral  database  before  they  can  be  queried.  For  instance,  Splunk 
requires  forwarders  to  be  installed  on  data  source  devices, 
which  establish  connections  outbound  to  indexers  and  commu¬ 
nicate  a  large  amount  of  data  to  those  connections.  In  contrast. 
Gestalt  reaches  into  data  sources  in  a  controlled  way,  leading  to 
a  much  reduced  attack  surface. 

III.  Security  Architecture 

To  better  understand  the  security  implications  of  adding 
systems  like  Gestalt  to  an  already  existing  IT  infrastructure,  we 
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first  look  at  the  overall  architecture  of  the  resulting  system.  Fig. 
2  shows  the  main  components  involved  in  remote  monitoring 
performed  by  cyber  defenders  (on  the  right)  of  devices  (on  the 
left).  The  figure  shows  how  the  Gestalt  system  introduces  two 
primary  system  components:  the  Discovery  and  Query  Nodes 
(DQNs)  and  the  Query  Management  Service  (QMS). 

The  DQNs  provide  the  interface  between  the  devices  on  the 
network  and  the  Gestalt  system.  Rather  than  adding  new  proto¬ 
cols  to  be  implemented  by  end  devices,  the  DQNs  leverage 
existing  protocols  where  possible  (Fig.  2  left),  including  the 
Simple  Network  Management  Protocol  (SNMP)  v3  [1],  SSH, 
and  the  Distributed  Management  Task  Force  (DMTF)  Web 
Services  for  Management  (WS-MAN)  [16]  technologies,  to 
communicate  with  individual  devices  and  to  catalog  the  data 
sources  each  manages.  The  QMS  provide  the  main  interface  to 
cyber  defenders  and  interacts  with  multiple  DQNs  through 
query  dispatch  and  response  interpretation  logic. 

The  overall  resiliency  argument  for  this  architecture  con¬ 
sists  of  two  main  parts:  network  and  process  arguments.  By 
constructing  a  sound  network  argument,  we  ensure  integrity 
and  confidentiality  of  the  data  transmitted  over  the  network  as 
well  as  a  robust  means  for  authenticating  various  actors,  includ¬ 
ing  processes  and  humans.  The  network  argument  is  con¬ 
structed  using  three  different  types  of  network  protocols: 

Device  Access  Protocols:  In  the  cases  where  a  single  data 
source  can  be  reached  by  multiple  access  protocols,  the  DQN 
will  automatically  select  the  strongest  access  protocol  available 
by  preferring  authenticated  and  encrypted  protocols.  In  addi¬ 
tion,  the  DQN  will  enforce  policy  control  over  the  choice  of 
access  protocols  to  be  used  to  block  use  of  protocols  that  are 
known  to  be  unsafe  and  where  the  risk  of  using  the  protocol 
exceeds  the  benefits  of  getting  observables  from  the  data 
source  through  the  protocol.  For  instance,  the  DQN  might  be 
configured  to  avoid  communicating  with  TLS  endpoints  that 
are  vulnerable  to  the  Heartbleed  [17]  attack  vector  in  order  to 
protect  the  DQN’s  client  process  from  compromise.  Finally,  to 
achieve  availability,  the  DQN  may  fail  over  between  accepta¬ 


ble  access  methods  to  achieve  visibility  in  degraded  mode. 

Gestalt  Management  Protocol  (GMP):  The  GMP  speci¬ 
fies  interactions  between  the  QMS  and  the  DQNs  in  a  way  that 
minimizes  the  attack  surface  in  the  DQNs  and  ensures  opera¬ 
tion  in  contested  network  environments  through  asynchronous 
polling  semantics.  Since  this  protocol  is  added  to  the  existing 
system,  it  is  constructed  with  strong  security  controls  and  prin¬ 
ciples  in  mind,  including  the  use  of  TLS  vl.2  [18]  tied  in  with  a 
robust  PKI  infrastructure,  e.g.,  maintained  by  DISA  for  the 
DoD.  Firewalls  restrict  allowable  GMP  communication  to  a 
dedicated  QMS  per  DQN. 

QMS  Access  Protocol:  The  QMS  offers  a  RESTful  [19] 
API  that  enables  access  by  modern  Web  Browsers  and  external 
applications.  Traffic  is  protected  at  the  highest  level  supported 
by  Web  Browsers  and  external  applications,  e.g.,  currently  TLS 
vl.2  for  a  selected  subset  of  browsers.  Firewalls  restrict  allow¬ 
able  communication  to  a  set  of  known  IP  addresses. 

The  result  of  the  network  argument  is  a  management  proto¬ 
col  that  can  be  added  to  existing  IT  infrastructure  as  a  solid 
foundation  for  other  resiliency  techniques  to  build  upon.  The 
process  argument  for  DQNs  implements  resiliency  at  the  appli¬ 
cation-level  through  the  following  means. 

Process  Isolation:  The  DQN  is  split  into  two  main  compo¬ 
nents  -  a  long-lived  Manager  process  and  multiple  transient 
Bridge  processes. 

Process  Rejuvenation:  The  Manager  can  temporally  con¬ 
strain  the  effects  of  compromised  Bridge  processes  by  killing 
the  process  as  soon  as  its  intended  functionality  is  complete, 
where  intended  functionality  can  be  defined  over  a  set  of  re¬ 
quests. 

Adaptive  Monitoring:  The  Manager  controls  the  lifecycle 
of  Bridges  by  spawning  them,  observing  their  state,  and  killing 
them  if  they  deviate  from  the  norm.  Anomaly  detection  is 
based  on  simple  statistical  methods,  based  on  variance  analysis 
of  usage  patterns  on  underlying  resources. 

Process  Restrictions:  Mandatory  policy  enforcement  is 
enabled  (e.g.,  using  SELinux  [20])  to  explicitly  allow  the  mi¬ 
nimal  set  of  interactions  required  for  the  process  to  function 
following  a  default-deny  paradigm. 

User  Input  Filtering:  Any  data  resource  from  data  sources 
is  filtered  and  sanitized  by  transcribing  it  into  a  different  repre¬ 
sentation  format.  Filtering  includes  checking  maximum  data 
size  and  sanitization  includes  turning  the  inputs  from  their  na¬ 
tive  representation  format  into  RDF/XML. 

The  process  arguments  for  the  QMS  are  similar  and  omitted 
from  this  paper  for  the  sake  of  brevity. 

IV.  The  Gestalt  Management  Protocol 

The  GMP  specifies  the  connection  behaviors  and  message 
exchanges  between  the  QMS,  the  DQN  Manager,  and  the  DQN 
Bridge  processes.  The  GMP  was  designed  based  on  the  follow¬ 
ing  goals  aimed  at  keeping  the  protocol  vendor  independent, 
secure,  and  implementable. 

Type  centric:  GMP  specifies  the  message  structure  rather 
than  the  programming.  This  enables  the  use  of  Transmission 
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Control  Protocol  (TCP)/IP -based  load  balancers  and  also 
achieves  central  processing  unit  (CPU)  architecture,  operating 
system,  and  programming  language  independence. 

Network  friendly:  GMP  works  through  firewalls  that  per¬ 
form  Network  Address  Translation  (NAT)  and  optimizes  net¬ 
work  usage  by  virtue  of  using  HTTP  response  codes,  avoiding 
unnecessary  round  trip  paths. 

Secure:  GMP  uses  a  mutually  authenticated  Public  Key  In¬ 
frastructure  (PKI)  and  TLS  for  encryption. 

Pragmatic:  GMP  is  designed  to  be  implemented  on  top  of 
a  strong  commercial  and  open-source  community. 

The  overall  architecture  of  GMP  is  displayed  in  Fig.  3.  The 
figure  shows  a  QMS  process  (QMS)  on  the  right,  a  DQN  Man¬ 
ager  (MGR)  process  in  the  middle,  and  a  DQN  Bridge  (BRI) 
process  on  the  left.  The  Manager  requests  commands  from  the 
QMS  and  sends  them  to  the  Bridge  processes  for  execution. 
The  Manager  checks  for  response  messages  from  the  Bridge 
and  sends  them  to  the  QMS.  The  Bridge,  Manager,  and  QMS 
implement  a  GMP-compliant  interface  in  addition  to  their  in¬ 
ternal  functions. 

The  GMP  specification  defines  interactions  between  QMS, 
MGR,  and  BRI.  As  shown  in  Fig.  3,  the  QMS  and  BRI  host 
HTTP  servers,  while  the  MGR  incorporates  a  li¬ 
brary/component  that  participates  as  an  HTTP  client.  The  MGR 
connects  to  the  QMS  by  establishing  a  mutually  authenticated 
TLS  connection  to  the  QMS’s  HTTP  server.  This  methodology 
ensures  the  MGR  is  not  exposed  to  direct  network  attacks  be¬ 
cause  it  does  not  provide  a  listening  network  socket.  Listening 
network  sockets  open  up  a  significant  amount  of  kernel  code  to 
remote  attack  because  packets  need  to  be  read  in  from  the  net¬ 
work  and  interpreted  before  authentication  is  performed.  By 
switching  to  outbound-only  connections,  the  MGR  not  only 
minimizes  the  resources  set  aside  in  their  TCP/IP  stack  but  also 

Monitored  Monitored  Network  NOC 

Intranet  Firewalls  Outside  Firewall 


causes  early  failures  if  responses  do  not  match  up  with  re¬ 
quests.  Also  note  that  the  communication  between  MGR  and 
BRI  is  routed  over  the  DQN’s  trusted  loopback  network.  The 
Bridge  must  bind  its  HTTP  server  to  the  loopback  network 
only.  Therefore,  the  DQN  does  not  provide  any  externally  re¬ 
solvable  listening  socket.  Also,  since  the  loopback  network  is 
trusted,  the  connection  between  the  MGR  and  the  BRI  goes 
over  plain  TCP  connections. 

The  reason  for  selecting  HTTP  between  MGR  and  BRI 
over  other  local  communication  means,  e.g.,  JAVA  RMI,  is 
that  (1)  HTTP  does  not  introduce  any  other  dependencies  (such 
as  registry  components)  that  need  to  be  secured  and  (2)  also 
allows  for  dedicated  connection  per  request  interaction  patterns 
that  tradeoff  increased  security  with  performance.  The  reason 
for  favoring  a  polling  approach  between  MGR  and  QMS  over 
push  approach,  e.g..  Bidirectional-streams  Over  Synchronous 
HTTP  (BOSH)  [21],  is  as  follows.  First,  polling  with  a  connec¬ 
tion  per  request  model  works  better  in  disruptive  environments, 
where  constant  network  connectivity  cannot  be  assumed,  e.g., 
monitoring  of  cyber  assets  close  to  the  tactical  edge.  Second, 
polling  requests  can  be  used  as  an  active  measurement  of  con¬ 
nectivity  between  the  clients  and  the  server  (like  a  heartbeat 
protocol)  to  detect  issues  before  data  sharing  is  required  with¬ 
out  the  need  to  add  heartbeat  protocols  on  top  of  HTTP.  Third, 
limiting  the  lifespan  of  active  connections  leads  to  a  reduced 
attack  surface.  Finally,  the  overhead  associated  with  polling  is 
small  given  the  number  of  DQNs  involved  and  the  strategic  use 
of  HTTP  204  response  codes.  Conversely,  requirements  on 
end-to-end  latencies  for  queries  are  to  reduce  access  time  from 
days  to  minutes,  which  make  sacrificing  some  end-to-end  la¬ 
tency  for  increased  security  and  robustness  worthwhile. 

Note  that  if  firewalls  get  in  the  way  of  allowing  inbound 
connections  to  the  QMS,  a  different  optional  GMP  interaction 
pattern  can  be  developed  and  deployed  which  includes  a  Gate- 
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Fig.  4.  GMP  deployment  in  a  Demilitarized  Zone  (DMZ)  of  a  Network  Operations  Center  (NOC) 


TABLE  1.  GMP  COMMANDS 


Command 

Description 

getStatusO 

Instructs  the  MGR  to  respond  with  a 
report  that  identifies  the  current  runtime 
state  of  the  MGR,  including  all  of  the 
runtime  state  of  the  BRI  it  manages. 

abortCommand(id) 

Abort  command  identified  by  id. 

setConfiguration 

(configuration) 

Apply  configuration  settings  expressed 
in  configuration  XML  document. 

executeQuery 

(SPARQLquery) 

Execute  the  query  and  return  the  results. 

scheduleDiscovery 
(min,  hour,  month, 
dom,  dow): 

Sets  up  a  schedule  for  running  active 
discovery,  one  time  or  repeating.  Note 
that  the  scheduling  is  similar  to  the  Li¬ 
nux  or  UNIX  “cron”  command. 

TABLE  IT  GMP  RESPONSES 

Response  Description 

Ack  Acknowledging  the  MGR  has  parsed  a  given  com- 


mandMessage  and  will  take  action  per  the  ackRes- 
ponse  attribute.  The  ackResponse  is  exactly  one  of: 
Working,  Invalid,  or  Inappropriate.  Working  indi¬ 
cates  that  the  MGR  is  about  to  begin  processing  the 
specified  command  block.  Invalid  indicates  one  oi 
more  parsing  errors  in  the  commandMessage.  In¬ 
appropriate  indicates  that  one  or  more  commands 
in  the  commandMessage  are  inappropriate  for  that 
_ MGR,  e.g.,  unsupported  commands. _ 

Nack  When  the  MGR  receives  a  commandMessage  it  is 
unable  to  parse  to  retrieve  a  commandMes- 
_ sage@ID,  a  Nack  message  is  returned. _ 

Success  The  MGR  acknowledges  that  a  given  set  of  com¬ 
mands  successfully  executed.  The  results  is  one  ol 
(1)  nothing,  (2)  a  statusReport,  or  (3)  a  queryRe- 
_ suit. _ 

Failure  The  MGR  notifies  the  QMS  that  a  given  com¬ 
mandMessage  failed  to  successfully  process.  Fail¬ 
ure  types  supported  include 

•  unreachableError:  Reachability  problems  to 
components  critical  for  command  execution 

•  storageExceededErrorType:  Problems  with 
persistence  store  overrun  on  the  DQN 

•  invalids tateErrorType:  Problems  with  excep¬ 
tions  triggered  by  command 

•  interruptedErrorType:  Problems  with  com¬ 
mands  timing  out 

•  malformedContentErrorType:  Problems  with 
parsing  data  supplied  by  the  QMS 

way  process  placed  into  the  QMS’s  Demilitarized  Zone 
(DMZ).  This  Gateways  process,  shown  in  Fig.  4,  allows  both 
the  DQN  and  the  QMS  to  make  outbound-only  connections 
through  their  respective  firewalls.  The  Gateway  logic  is  simply 
to  forward  GMP  messages  between  the  two  endpoints.  To  pre¬ 
vent  against  corruption  of  the  Gateway  due  to  its  exposed  loca¬ 
tion  in  the  DMZ,  GMP  messages  may  optionally  be  protected 
via  signatures  implemented  through  XML  Signature  and  en¬ 
crypted  using  XML  Encryption. 


A.  Connection  Management 

Communication  between  MGR  and  the  QMS  must  be  via 
HTTPS.  In  order  to  reduce  the  attack  risks  to  the  MGR,  the 
following  constraints  apply: 

•  All  communications  must  be  initiated  by  the  MGR, 
never  the  QMS.  With  respect  to  GMP,  the  MGR  is  a 
client  only  and  never  a  server. 

•  Both  the  MGR  and  QMS  mutually  authenticate  all 
communications  via  mutually  authenticated  certificate 
exchange. 

•  Both  the  MGR  and  QMS  must  ensure  that  the  certifi¬ 
cates  of  the  other  device  have  not  been  revoked  or  ex¬ 
pired. 

The  following  interaction  patterns  are  designed  to  optimize 
network  usage: 

•  For  polling  intervals  in  the  minute  range,  TLS  connec¬ 
tions  are  created  in  a  dedicated  manner  for  the  poll  re¬ 
quest.  For  short  lived  intervals  (seconds),  TLS  connec¬ 
tions  may  be  reused. 

•  The  QMS  returns  HTTP  response  204  (no  content)  in 
case  no  commands  are  available 

B.  Command  Management 

If  the  MGR  polls  the  commandURI  and  receives  an  HTTP 
204  message,  there  are  no  commands  for  that  MGR  at  that  time 
and  the  cycle  is  complete  until  the  next  polling  interval. 

Linking  responses  to  commands  over  multiple  HTTP  con¬ 
nections  is  maintained  by  repeated  use  of  a  UUID  attached  to 
the  initial  commandMessage.  This  UUID  is  referred  to  as  the 
commandMessage@ID  and  is  referenced  in  each  responseMes- 
sage  sent  from  the  MGR  to  the  QMS  via  the  root_id  element. 
The  one  exception  to  the  commandMessage@ID  being  equal  to 
responseMessage@root_id  is  when  a  nack  response  is  sent, 
informing  the  QMS  that 
the  MGR  was  unable  to 
parse  the  content.  Note  that 
the  responseMessage  con¬ 
tains  another  optional  id 
element,  called  part_id, 
which  is  used  for  two  pur¬ 
poses.  First,  some  com¬ 
mands,  like  the  schedule- 
Discovery  command, 
cause  asynchronous 

processing  to  happen  mul¬ 
tiple  times,  with  a  respon¬ 
seMessage  being  generated 
each  time.  In  this  case,  the 
root_id  will  point  to  the 
commandMessage  in 
which  the  scheduleDisco- 
very  command  was  in¬ 
cluded,  and  each  respon¬ 
seMessage  will  have  a 

unique  part_id.  The  second  Fig.  5.  Message  Overview 
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use  case  involves  a  single  query  and  response  exchange  in 
which  multiple  reponseMessages  are  returned,  each  one  con¬ 
taining  a  partial  result  set.  Again,  the  root_id  will  point  to  the  id 
attribute  of  the  commandMessage  carrying  the  query,  while  the 
part_id  attribute  is  unique  to  the  specific  responscMessage. 

Similarly,  each  response  chosen  by  the  MGR  to  send  to  the 
QMS  is  wrapped  in  a  responseList.  The  MGR  sends  the  res- 
ponseMessage  via  HTTP  POST  to  the  responseURI  of  the 
QMS.  The  same  way  the  responscMessages  are  linked  to 
commandMessage  at  the  message  level  (through  shared  use  of 
id  values),  responses  and  linked  to  commands  at  the  command 
level  through  shared  use  of  id  attributes.  Graphically,  these 
XML  structures  can  be  depicted  as  shown  in  Fig.  5. 

TABLE  I  lists  the  current  set  of  commands,  and  TABLE  II 
lists  the  current  set  of  responses. 

V.  Process  Security  Arguments 

Given  the  strong  security  guarantees  provided  by  network 
protocols  described  earlier,  the  next  level  of  constructing  a  resi¬ 
liency  argument  deals  with  a  threat  model  in  which  the  adver¬ 
sary  has  compromised  one  of  the  data  sources  that  is  being 
accessed  by  the  DQN,  see  Fig.  6. 

For  the  purpose  of  evaluating  potential  threats  and  design¬ 
ing  mitigations,  we  created  a  threat  model  in  the  form  of  an 


attack  tree,  shown  in  Fig.  7.  From  left  to  right,  the  attack  tree 
decomposes  the  high-level  goal  of  “Exploiting  Gestalt”  into 


five  specific  attack  branches  as 
follows: 


(1)  Recon  (2)  Spread 

(2)  Crash/Flood 


Learn  from  Queries, 
causing  loss  of  confi¬ 
dentiality  by  misusing 
queries  issued  by  Ges¬ 
talt  for  attack  recon¬ 
naissance. 


Fig.  6.  Starting  Point:  Corrupted 
Data  Source 


•  Escalate  Privilege, 

causing  loss  of  integrity  in  various  Gestalt  components 
and  loss  of  confidentiality  for  access  credentials  in  the 
process. 


•  Deny  Service,  causing  loss  of  availability. 


•  Corrupt  Data  Source,  causing  loss  of  integrity  of  analy¬ 
sis  results  by  generating  bad  observables  on  the  locally 
compromised  data  source. 


•  Corrupt  Processing,  causing  loss  of  integrity  of  analy¬ 
sis  results  by  externally  spoofing  data  source  obser¬ 
vables. 


Each  branch  is  further  refined  into  specific  attacks.  In  addition. 


we  annotate  mitigation  mechanisms  towards  the  leaves  via  spe¬ 
cialized  “mitigated  by”  branches. 

A.  Learn  from  Queries 

Starting  from  a  compromised  monitored  device,  an  adver¬ 
sary  observes  specific  queries  issued  by  the  DQN  against  that 
device.  For  example,  if  the  DQN  issues  commands  against  bi¬ 
naries  looking  for  a  specific  string  S,  the  adversary  knows  that 
attack  binaries  containing  S  will  be  detected  easily  going  for¬ 
ward  and  hence  it  would  be  prudent  not  to  use  S  anymore  if  the 
objective  is  to  stay  dormant.  Even  worse,  the  query  contains 
information  containing  other  sensitive  devices  on  the  network, 
e.g.,  show  me  all  outbound  connections  from  end  systems  to  a 
highly  sensitive  server.  In  this  case,  the  adversary  might  learn 
information  about  the  highly  sensitive  server  with  only  suffi¬ 
cient  privileges  on  a  Gestalt-monitored  edge  device.  In  some 
sense.  Gestalt  would  do  reconnaissance  on  behalf  of  the  adver¬ 
sary  in  this  case. 

While  the  data  aggregation  problem  in  very  hard  to  solve  in 
general  (e.g.,  two  pieces  of  unclassified  information  becoming 
highly  classified).  Gestalt  mitigates  against  such  attacks 
through  the  following  means: 

1 .  The  DQN  can  send  a  potentially  large  number  of  que¬ 
ries  to  the  device,  hiding  the  real  query  among  the  set  of  all 
queries.  The  benefit  of  hiding  needs  to  be  carefully  balanced 
with  the  increased  load  on  the  network  and  end  system. 

2.  The  DQN  can  send  a  more  general  query  to  the  moni¬ 
tored  device,  and  then  subset  the  responses  internal  to  the 
DQN.  This  pulls  out  more  data  from  the  actual  device  into  the 
DQN,  which  might  expose  it  a  little  more  through  centraliza¬ 
tion  (something  that  ICAS  is  explicitly  trying  to  avoid). 

Specific  choices  can  be  customized  based  on  what  is  known 
by  the  DQN  about  the  monitored  device  before  it  is  actually 
accessed.  For  instance,  if  a  device  looks  suspicious  based  on  its 
network  behavior,  it  might  make  sense  to  employ  either  strate¬ 
gy  1  or  2  for  this  specific  device,  while  maintaining  a  default 
behavior  of  sending  specific  queries  to  devices  otherwise. 

B.  Escalate  Privilege 

Since  we  assume  that  monitored  devices  are  compromised 
and  the  DQN  needs  to  interact  with  those  devices,  there  is  a 
clear  path  for  adversaries  to  spread  out  to  other  devices  through 
Gestalt.  One  attack  would  be  to  gain  access  to  a  larger  set  of 
credentials  used  in  the  DQN  to  gain  remote  access  to  other 
devices,  either  through  the  same  access  protocol  or  different 
access  protocols.  This  can  be  achieved,  for  instance,  by  sending 
some  input  to  the  data  interpreter  on  the  DQN  that  causes  a 
buffer  overflow.  Along  the  same  lines,  attacks  might  directly 
target  the  DQN  functionality  and  metadata  information  to  ei¬ 
ther  blind  cyber  defenders  or  to  misdirect  efforts.  Finally,  the 
attacker’s  next  logical  step  would  be  escalating  privileges  to 
the  QMS,  and  from  there,  compromising  the  browser  used  by 
cyber  defenders  to  access  the  QMS. 

Gestalt  mitigates  these  attacks  using  a  strong  containment 
strategy  throughout  the  DQN,  isolating  adapters  that  interact 
directly  with  the  devices  from  extractors  and  management 
components.  Isolation  is  done  on  a  per  process  basis  with  cus¬ 
tom  SELinux  policies  written  for  each  process  restricting 


access  down  to  individual  files  and  network  resources.  Persis¬ 
tent  storage  is  also  fragmented  into  per  process  files  (e.g.,  key 
stores),  a  metadata  index  database,  and  a  configuration  store. 
Finally,  interactions  between  components  are  controlled  by 
application-level  filters  for  user  input  validation.  Corruption  is 
limited  by  process  rejuvenation  techniques,  e.g.,  starting  new 
processes  and  limiting  the  amount  of  time  they  are  allowed  to 
linger  before  disappearing  after  not  being  used.  Depending  on 
the  level  of  security  paranoia  required  to  interact  with  devices, 
the  following  configurations  for  adapter  processes: 

•  Per  device  adapter  processes,  with  dedicated  access  to 
only  the  credentials  available  to  access  that  device, 

•  Per  domain  adapter  processes,  where  a  domain  can  be 
a  grouping  over  devices  along  administrative  domains, 

•  Per  protocol  adapter  processes,  where  all  SNMP  devic¬ 
es  are  handled  through  a  single  SNMP  adapter  instance 
with  a  single  credential  file. 

Finally,  the  DQN  host  itself  is  installed  and  administered 
using  best-practice  secure  host  techniques,  exposes  only  those 
services  absolutely  necessary  for  the  operation  and  mainten¬ 
ance  of  the  system,  and  uses  only  the  highest  level  of  security 
applicable  for  those  services  (i.e..  Public  Key  credentials  for 
authentication,  encrypted  network  protocols,  IP  source  address 
filtering  for  new  connections,  and  so  forth.) 

C.  Deny  Service 

With  the  specific  attack  objective  to  cause  loss  of  availabili¬ 
ty,  adversaries  might  crash  various  main  components  (DQN, 
QMS)  or  sub-components  of  them  (adapter  processes).  Another 
way  to  deny  service  is  to  overload  shared  resources,  e.g.,  return 
a  large  amount  of  data  to  the  DQN  to  cause  out-of-memory 
exceptions.  A  more  intricate  version  of  causing  denial  on  the 
discovery  component  of  the  DQN  is  to  create  conditions  that 
cause  the  discovery  algorithm  to  spin  out  of  control,  e.g.,  by 
creating  unexpected  loops  in  observables. 

Gestalt  mitigates  these  attacks  by  explicitly  and  actively 
managing  the  lifecycle  of  processes  and  jobs  that  these 
processes  need  to  perform.  Processes  are  monitored  for  activity 
and  functionality,  and  information  about  problems  is  relayed 
back  from  the  DQN  to  the  QMS.  Query  and  discovery  jobs  are 
explicitly  tracked  through  IDs  and  can  be  started,  stopped,  and 
referred  to  for  fetching  completion  results.  These  jobs  will  also 
be  granted  access  to  only  those  resources  (disk,  memory,  etc.) 
deemed  necessary  for  their  functioning  and  will  be  terminated 
before  a  resource  exhaustion  issue  could  affect  the  operation  of 
the  overall  system. 

D.  Corrupt  Data  Source 

This  part  of  the  attack  tree  captures  the  fact  that  devices  can 
lie  to  the  DQN  if  the  reporting  mechanism  used  by  the  device  is 
also  corrupted  (which  is  generally  assumed).  Put  this  way,  the 
benefit  of  using  a  DQN  to  access  state  reported  by  the  cor¬ 
rupted  device  itself  needs  to  be  treated  differently  from  infor¬ 
mation  obtained  from  other  devices  (e.g.,  NIDS)  about  that 
device. 

Gestalt  mitigates  these  attacks  by  keeping  track  of  prove¬ 
nance  information  associated  with  observables  to  enable  high- 


er-level  reasoning  and  deconfliction  by  cyber  defenders.  Prov¬ 
enance  information  is  assembled  by  various  components  of 
Gestalt,  including  the  DQN  and  the  QMS,  in  a  crypto-strong 
way  that  allows  for  integrity  checks  in  the  QMS. 

E.  Corrupt  Processing 

The  overall  goal  of  this  attack  is  to  use  Gestalt  against  the 
defender  by  corrupting  its  functionality  in  various  ways.  For 
instance,  traffic  replay  (at  various  points,  not  just  the  device 
access  protocols)  might  make  it  look  to  Gestalt  as  if  the  world 
either  did  not  change  (although  it  changed)  or  did  just  signifi¬ 
cantly  (although  it  did  not).  The  goal  of  the  first  case  is  to  hide 
within  the  Gestalt  monitoring  framework,  while  the  second 
case  is  to  cause  cyber  defenders  to  go  down  rat  holes  that  take 
focus  and  attention  away  from  the  real  issues  at  hand.  An  inter¬ 
esting  case  involves  corruption  in  the  form  that  causes  the  Ges¬ 
talt  operators  to  actively  do  work  on  behalf  of  the  adversary.  If 
a  DQN  is  corrupted,  can  it,  by  reporting  certain  values  back  to 
the  QMS,  get  the  defenders  to  execute  queries  on  other  DQNs 
and  then  feed  back  the  results  of  those  queries  into  the  cor¬ 
rupted  DQN?  Such  a  setup  could  also  be  used  for  an  amplified 
denial-of-service  attack,  where  the  results  reported  back  from  a 
single  corrupted  DQN  could  warrant  a  wide  range  of  QMS 
initiated  interactions  with  other  DQNs  and  devices. 

VI.  Conclusion  and  Next  Steps 

The  current  enterprise  IT  infrastructure  remains  vulnerable 
to  targeted  cyber  attacks  that  stay  undetected  for  months,  de¬ 
spite  the  fact  that  a  variety  of  low-level  observables  are  availa¬ 
ble,  audited,  and  recorded.  This  establishes  the  need  for  a  re¬ 
mote  monitoring  framework  that  can  integrate  with  existing 
data  sources  in  a  secure  manner,  dispatch  queries  from  a  uni¬ 
fied  presentation  to  specific  data  sources  at  hand,  and  securely 
integrate  results  back  into  a  consistent  and  reliable  cyber  opera¬ 
tional  picture. 

This  paper  describes  the  security  and  resiliency  architecture 
for  the  remote  monitoring  framework  we  are  currently  develop¬ 
ing  under  the  DARPA  ICAS  program.  It  describes  how  this 
framework  strategically  combines  strong  network  resiliency 
and  protection  with  process-level  resiliency  techniques,  includ¬ 
ing  isolation,  rejuvenation,  and  adaptive  monitoring/response. 
The  overall  claim  is  designing  security  and  resiliency  into  the 
architecture  from  the  beginning  and  in  a  bottom-up  way  allows 
creation  of  the  argument  that  the  value  of  adding  the  new  moni¬ 
toring  framework  to  an  already  existing  IT  infrastructure  out¬ 
weighs  the  risks  associated  with  increasing  the  attack  surface. 

Going  forward,  we  plan  to  participate  in  an  external  evalua¬ 
tion  of  the  security  argument  performed  by  an  adversarial  part¬ 
ner  as  part  of  the  ICAS  program.  In  addition,  we  plan  to  pro¬ 
vide  implementations  for  an  increasing  set  of  mitigations  out¬ 
lined  in  the  attack  tree,  and  further  refine  the  tree  by  adding 
linking  in  integration  tests  via  “verified  by”  leaf  branches. 
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